Practical NGS processing and Analysis

Andrea Bultó Sánchez

2024-02-08

Introduction

The aim of this practical is to extend the hands-on to the comparison of brain and liver data of mouse embryo samples at 12.5 days of development.

First, run the container:

docker run -i -t -v /Users/andreab/Downloads/tutorial:/tutorial -w /tutorial ceciliaklein/teaching:uvic

After running this command, we can run the executable files in /tutorial/teaching-utils/ from anywhere in the terminal session without specifying the full path to the executable.

export PATH=$PATH:/tutorial/teaching-utils/;

For the following report we will create a new folder to store all outputs:

mkdir tutorial/final_task
mkdir tutorial/final_task/Q1 tutorial/final_task/Q2 tutorial/final_task/Q3 tutorial/final_task/Q4 tutorial/final_task/Q3 tutorial/final_task/Q5

Q1

Perform differential expression analysis between brain and liver using the EdgeR. Present results using a heatmap with hierarchical clustering in rows and columns and colored classification of differentially expressed genes (DEGs), i.e. overexpressed in brain versus liver and the other way around.

cd final_task/Q1 

We choose coefficient 3 to compare brain and liver.

edgeR.analysis.R --input_matrix /tutorial/quantifications/encode.mouse.gene.expected_count.idr_NA.tsv \
                 --metadata /tutorial/data/gene.quantifications.index.tsv \
                 --fields tissue \
                 --coefficient 3 \
                 --output brain_X_liver

We write a list of the genes overexpressed in brain compared to liver, according to edgeR analysis:

awk '$NF<0.01 && $2<-10{print $1"\tover_brain_X_liver"}' edgeR.cpm1.n2.brain_X_liver.tsv > edgeR.0.02.over_brain_X_liver.txt #118

And genes overexpressed in liver compared to brain:

awk '$NF<0.01 && $2>10 {print $1"\tover_liver_X_brain"}' edgeR.cpm1.n2.brain_X_liver.tsv > edgeR.0.02.over_liver_X_brain.txt #107

Count how many overexpressed genes there are in each stage:

wc -l edgeR.0.02.over*.txt

118 edgeR.0.02.over_brain_X_liver.txt
107 edgeR.0.02.over_liver_X_brain.txt
225 total

Getting all DEGs of brain and liver with their gene type:

awk '$3=="gene"{ match($0, /gene_id "([^"]+).+gene_type "([^"]+)/, var); print var[1],var[2] }' OFS="\t" /tutorial/refs/gencode.vM4.gtf \
| join.py --file1 stdin \
          --file2 <(cat edgeR.0.02.over*.txt) \
| sed '1igene\tedgeR\tgene_type' > gene.edgeR_BL.tsv

Perform a Heatmap:

# i copied the matrix from the TPM of the three tissues to a new file which i edited so i could choose only those fields for Brain and Liver
awk 'NR==1{print "-" "\t" $0; next} {print}' /tutorial/quantifications/encode.mouse.gene.TPM.idr_NA.tsv |cut -f1,2,3,6,7 > /tutorial/quantifications/encode.mouse.gene.TPM.idr_NA_BL.tsv  # i shifted the first row to the right (i had to add '-' to the first field of the first row) and selected only those fields for Brain and Liver
awk 'NR == 1 {sub(/^-[\t]/, ""); print} NR > 1' /tutorial/quantifications/encode.mouse.gene.TPM.idr_NA_BL.tsv > /tutorial/quantifications/encode.mouse.gene.TPM.idr_NA_BL_fixed.tsv # i removed the first field '-' in the first row

# for the heatmap i chose the matrix file for Brain and Liver TPM      
 cut -f1 gene.edgeR_BL.tsv | tail -n+2 | selectMatrixRows.sh - /tutorial/quantifications/encode.mouse.gene.TPM.idr_NA_BL_fixed.tsv \
| ggheatmap.R --width 5 \
              --height 8 \
              --col_metadata /tutorial/data/temp_metadata.tsv \
              --colSide_by tissue \
              --col_labels labExpId \
              --row_metadata gene.edgeR_BL.tsv \
              --merge_row_mdata_on gene \
              --rowSide_by edgeR,gene_type \
              --row_labels none \
              --log \
              --pseudocount 0.1 \
              --col_dendro \
              --row_dendro \
              --matrix_palette /tutorial/palettes/palDiverging.txt \
              --colSide_palette /tutorial/palettes/palTissue2.txt \
              --output heatmap.brain_X_liver.pdf

Heatmap Brain_Liver

Q2

Perform gene ontology enrichment analysis of the two sets of DEGs using the command line wrapper of GOstats R package for biological processes. Plot results using any graphical representation and discuss results.

cd ../Q2

Prepare a file with the genes in the annotation:

awk '{split($10,a,/\"|\./); print a[2]}' /tutorial/refs/gencode.vM4.gtf | sort -u > universe.txt

Launch the GO enrichment script for the Biological Processes in the set of genes overexpressed in brain and liver:

The results can be visualized in a web browser, using the online Gene Ontology visualizer REVIGO.

for file in ../Q1/edgeR.0.02.over_liver_X_brain.txt ../Q1/edgeR.0.02.over_brain_X_liver.txt; do awk '{split($1,a,"."); print a[1]}' $file\
| GO_enrichment.R --universe universe.txt \
                  --genes stdin \
                  --categ BP \
                  --output $(basename ${file})_output \
                  --species mouse ;done

We copy the command output and paste it in the text box on the main page of the REVIGO web site.

awk 'NR==1{$1="% "$1}{print $1,$2}' edgeR.0.02.over_brain_X_liver.txt_output.BP.tsv
awk 'NR==1{$1="% "$1}{print $1,$2}' edgeR.0.02.over_liver_X_brain.txt_output.BP.tsv

GO enrichment for Biological Processes - Brain-Liver

GO enrichment for Biological Processes - Liver-Brain

Q3

Analyze differential splicing using SUPPA between brain and liver for skipping exon, intron retention, mutually exclusive exon and alternative first exon. Plot top results using heatmaps. Different thresholds may be chosen for each event type.

cd ../Q3

Types of events: Skipping Exon (SE), Retained intron (RI) , Mutually Exclusive Exons (MX), Alternative First (AF)

(FL : Alternative first (AF) and last (AL) exons (it generates both))

suppa.py generateEvents -i /tutorial/splicing/exon-annot.gtf -e SE RI MX FL -o localEvents -f ioe

Compute PSI for SE, RI, MX, AF:

for event in SE RI MX AF; do suppa.py psiPerEvent --total-filter 10 --ioe-file localEvents_${event}_strict.ioe --expression-file /tutorial/splicing/pc-tx.tsv -o PSI-${event}; done

Create individual files per tissue and event:

for event in SE RI MX AF; do for tissue in Brain Liver ;do echo -1 "Processing event $event and $tissue tissue"; selectMatrixColumns.sh PRNAembryo${tissue}1:PRNAembryo${tissue}2 PSI-${event}.psi > ${tissue}.${event}.psi;done;done

Compute differential splicing analysis for each event, comparing both tissues against all and performing multiple testing correction:

for event in SE RI MX AF; do
    suppa.py diffSplice --method empirical \
                        --input localEvents_${event}_strict.ioe \
                        --psi Brain.${event}.psi  Liver.${event}.psi \
                        --tpm /tutorial/splicing/expr.Brain.tsv /tutorial/splicing/expr.Liver.tsv \
                        -c -gc -o DS.${event} ; done

For SE and AF pval 0.05 and deltaPSI 0.5

for event in SE AF; do awk 'BEGIN{FS=OFS="\t"}NR==1{print }NR>1 && $2!="nan" && ($2>0.5 || $2<-0.5) && $3<0.05{print}' DS.${event}.dpsi |cut -f1 |grep -v Brain > top.${event}.Brain-Liver.txt ;done

For RI and MX pval 0.05 and deltaPSI 0.3

for event in RI MX ; do awk 'BEGIN{FS=OFS="\t"}NR==1{print }NR>1 && $2!="nan" && ($2>0.3 || $2<-0.3) && $3<0.05{print}' DS.${event}.dpsi |cut -f1 |grep -v Brain > top.${event}.Brain-Liver.txt ;done

Generate a heatmap for event:

for event in SE RI MX AF; do selectMatrixRows.sh top.${event}.Brain-Liver.txt DS.${event}.psivec > top.${event}.Brain-Liver.tsv; done
for event in SE RI MX AF; do ggheatmap.R -i top.${event}.Brain-Liver.tsv \
  -o heatmap.${event}.psi.Brain-Liver.pdf \
  --matrix_palette /tutorial/palettes/blue.9.txt \
  --row_dendro \
  --col_dendro \
  --base_size 10; done

Heatmap Skipping Exon - Brain-Liver Heatmap Retained intron - Brain-Liver Heatmap Mutually Exclusive Exons - Brain-Liver Heatmap Alternative First - Brain-Liver

Q4

Find H3K4me3 peaks shared by brain and liver and the ones exclusively found in each tissue using the narrow peaks found in /tutorial/results using bedtools intersect. Show results using a bar plot colored by the color code used during the hands-on. Palette is availables at /tutorial/palettes/palTissue.txt. Any color may be chosen for shared peaks.

cd ../Q4

Have a look at the number of peaks called per tissue:

wc -l /tutorial/results/*narrowPeak

23467 /tutorial/results/CHIPembryoBrain.narrowPeak
25640 /tutorial/results/CHIPembryoLiver.narrowPeak
49107 total

Perform bedtools intersect:

#common peaks brain - liver
bedtools intersect -a /tutorial/results/CHIPembryoBrain.narrowPeak -b /tutorial/results/CHIPembryoLiver.narrowPeak |cut -f1-3 > brain-liver_common_peaks.bed #19356

# brain-specific
bedtools intersect -a /tutorial/results/CHIPembryoBrain.narrowPeak -b /tutorial/results/CHIPembryoLiver.narrowPeak -v |cut -f1-3 > brain-specific_peaks.bed #4812

# liver specific
bedtools intersect -a /tutorial/results/CHIPembryoLiver.narrowPeak -b /tutorial/results/CHIPembryoBrain.narrowPeak -v |cut -f1-3 > liver-specific_peaks.bed #7113
{
echo -ne "Brain-Specific\t" && wc -l < brain-specific_peaks.bed
echo -ne "Brain-Liver_Common\t" && wc -l < brain-liver_common_peaks.bed
echo -ne "Liver-Specific\t" && wc -l < liver-specific_peaks.bed
} > brain_liver_peak_counts.tsv

ggbarplot.R -i brain_liver_peak_counts.tsv -o barplot_peaks_brain_liver.pdf --title 'H3K4me3 peaks In Brain And Liver' --y_title 'Peak Count' --x_title "Tissue" --palette_fill /tutorial/palettes/palTissue_peaks.txt -f 1

palTissue_peaks.txt was adjusted so the color matched with the tissue palette.

Barplot peak counts

Q5

Create a BED file of 200bp up/downstream TSS of genes and overlap DEGs (step 1) with the 3 sets of H3K4me3 peaks classified in the previous step (4). Show three examples in the UCSC genome browser, including RNA-seq, ChIP-seq and ATAC-seq tracks. Ideally, one example of each peak set (i.e. shared peak, peak exclusively called in brain and peak exclusively called in liver). Discuss the integration of the three datasets in the TSS of the selected cases.

cd ../Q5

Getting genes that are protein-coding from the gene annotation +-200bp up/downstream TSS:

awk '$3 == "gene" && /gene_type "protein_coding"/ {print $0}' /tutorial/refs/gencode.vM4.gtf | awk 'OFS="\t" {if($7=="+") {start=$4-200; if(start<0) start=0; print $1,start,$4+200,$10,"+"} else if($7=="-") {print $1,$5-200,$5+200,$10,"-"}}' | sed 's/[";]//g' > protein_coding_TSS_regions.bed

cat protein_coding_TSS_regions.bed |bedtools sort > protein_coding_TSS_regions_sorted.bed 
cut -f1 ../Q1/edgeR.0.02.over_brain_X_liver.txt > brain_DEGs.txt
cut -f1 ../Q1/edgeR.0.02.over_liver_X_brain.txt > liver_DEGs.txt
grep -Ff brain_DEGs.txt protein_coding_TSS_regions_sorted.bed > brain_DEGs_TSS.bed 
grep -Ff liver_DEGs.txt protein_coding_TSS_regions_sorted.bed > liver_DEGs_TSS.bed 
bedtools intersect -a protein_coding_TSS_regions_sorted.bed -b ../Q4/brain-liver_common_peaks.bed > brain-liver_common_peaks_TSS.bed
bedtools intersect -a protein_coding_TSS_regions_sorted.bed -b ../Q4/brain-specific_peaks.bed  > brain-specific_peaks_TSS.bed
bedtools intersect -a protein_coding_TSS_regions_sorted.bed -b ../Q4/liver-specific_peaks.bed  > liver-specific_peaks_TSS.bed

We obtain the coordinates for the genes differentially expressed in each tissue for the common and tissue-specific H3K4me3 peaks.

bedtools intersect -a  brain_DEGs_TSS.bed -b ../Q4/brain-specific_peaks.bed | less #18
bedtools intersect -a  liver_DEGs_TSS.bed -b ../Q4/liver-specific_peaks.bed | less #43 
bedtools intersect -a  brain_DEGs_TSS.bed -b ../Q4/brain-liver_common_peaks.bed | less #49
bedtools intersect -a  liver_DEGs_TSS.bed -b ../Q4/brain-liver_common_peaks.bed | less #4 

UCSC browser:

Input for tracks:

RNAseq:

track autoScale=on name=RNAseq,brain,PRNAembryoBrain1 color=204,204,0 viewLimits=0:80 maxHeightPixels=30 type=bigWig visibility=2 bigDataUrl=https://public-docs.crg.es/rguigo/Data/cklein/courses/UVIC/UCSC/PRNAembryoBrain1.bw
track autoScale=on name=RNAseq,brain,PRNAembryoBrain2 color=204,204,0 viewLimits=0:80 maxHeightPixels=30 type=bigWig visibility=2 bigDataUrl=https://public-docs.crg.es/rguigo/Data/cklein/courses/UVIC/UCSC/PRNAembryoBrain2.bw
track autoScale=on name=RNAseq,liver,PRNAembryoLiver1 color=107,142,35 viewLimits=0:80 maxHeightPixels=30 type=bigWig visibility=2 bigDataUrl=https://public-docs.crg.es/rguigo/Data/cklein/courses/UVIC/UCSC/PRNAembryoLiver1.bw
track autoScale=on name=RNAseq,liver,PRNAembryoLiver2 color=107,142,35 viewLimits=0:80 maxHeightPixels=30 type=bigWig visibility=2 bigDataUrl=https://public-docs.crg.es/rguigo/Data/cklein/courses/UVIC/UCSC/PRNAembryoLiver2.bw

ChIP-seq:

track autoScale=on name=ChIP-seq,brain,CHIPembryoBrain1 color=204,204,0 viewLimits=0:80 maxHeightPixels=30 type=bigWig visibility=2 bigDataUrl=https://public-docs.crg.es/rguigo/Data/cklein/courses/UVIC/UCSC/CHIPembryoBrain1.bw
track autoScale=on name=ChIP-seq,brain,CHIPembryoBrain2 color=204,204,0 viewLimits=0:80 maxHeightPixels=30 type=bigWig visibility=2 bigDataUrl=https://public-docs.crg.es/rguigo/Data/cklein/courses/UVIC/UCSC/CHIPembryoBrain2.bw
track autoScale=on name=ChIP-seq,liver,CHIPembryoLiver1 color=107,142,35 viewLimits=0:80 maxHeightPixels=30 type=bigWig visibility=2 bigDataUrl=https://public-docs.crg.es/rguigo/Data/cklein/courses/UVIC/UCSC/CHIPembryoLiver1.bw
track autoScale=on name=ChIP-seq,liver,CHIPembryoLiver2 color=107,142,35 viewLimits=0:80 maxHeightPixels=30 type=bigWig visibility=2 bigDataUrl=https://public-docs.crg.es/rguigo/Data/cklein/courses/UVIC/UCSC/CHIPembryoLiver2.bw

ATAC-seq:

track autoScale=on name=ATAC-seq,brain,ATACembryoBrain1 color=204,204,0 viewLimits=0:80 maxHeightPixels=30 type=bigWig visibility=2 bigDataUrl=https://public-docs.crg.es/rguigo/Data/cklein/courses/UVIC/UCSC/ATACembryoBrain1.bw
track autoScale=on name=ATAC-seq,brain,ATACembryoBrain2 color=204,204,0 viewLimits=0:80 maxHeightPixels=30 type=bigWig visibility=2 bigDataUrl=https://public-docs.crg.es/rguigo/Data/cklein/courses/UVIC/UCSC/ATACembryoBrain2.bw
track autoScale=on name=ATAC-seq,liver,ATACembryoLiver1, color=107,142,35 viewLimits=0:80 maxHeightPixels=30 type=bigWig visibility=2 bigDataUrl=https://public-docs.crg.es/rguigo/Data/cklein/courses/UVIC/UCSC/ATACembryoLiver1.bw
track autoScale=on name=ATAC-seq,liver,ATACembryoLiver2, color=107,142,35 viewLimits=0:80 maxHeightPixels=30 type=bigWig visibility=2 bigDataUrl=https://public-docs.crg.es/rguigo/Data/cklein/courses/UVIC/UCSC/ATACembryoLiver2.bw

Brain specific peaks v Brain-DEGs:
Example: ENSMUSG00000047171.7 chr8:70120673-70121073
chr8:70120673-70121073.Brain

Liver specific peaks v Liver-DEGs:
Example: ENSMUSG00000026295.8 chr1:88407051-88407161
chr1:88407051-88407161.Liver

Common specific peaks v Brain-DEGs:
Example: chr10 58813159 58813559 ENSMUSG00000037990.14
chr10,58813159,58813559.Common

Q6

Show two examples of alternative first exons in the UCSC genome browser, including RNA-seq, ChIP-seq and ATAC-seq tracks. Discuss the integration of the three datasets in the TSS of the selected cases.

From the AF file generated in step 3:

less ../Q3/top.AF.Brain-Liver.txt

Example: ENSMUSG00000002345.13;AF:chr8:70139707:70139772-70141864:70140277:70140356-70141864:+
AF+

Example: ENSMUSG00000029922.11;AF:chr6:39410216-39419840:39420067:39410216-39420099:39420358:-
AF-2